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ABSTRACT 

This  paper  considers  the  problem  of  determining  a  parsimonious  neural  network  for  use 
in  prediction/generalization  based  on  a  given  fixed  learning  sample.  Both  the  classification  and 
nonlinear  regression  contexts  are  addressed.  Following  an  introduction  to  the  problem  and 
survey  of  past  research  on  model  selection  techniques  in  other  statistical  settings,  algorithms  for 
selecting  the  number  of  hidden  layer  nodes  in  a  three  layer,  feedforward  neural  network  are 
presented.  The  selection  criterion  attempts  to  "grow"  the  network  beginning  with  a  small  initial 
number  of  hidden  layer  nodes  (as  opposed  to  pruning  a  relatively  large  network).  For  the 
nonlinear  regression  problem,  the  method  is  based  on  cross-validation  estimates  of  the  prediction 
mean  prediction  error  for  the  candidate  networks.  For  the  classification  problem,  the  method  is 
based  on  resubstitution  estimates  of  the  misclassification  probability  for  the  candidate  networks. 
Also  considered  is  the  use  of  principal  components  analysis  on  the  training  set  in  order  to  reduce 
the  dimensionality  of  the  input  vector  prior  to  "growing"  the  parsimonious  network.  Test  cases 
and  applications  of  the  methods  described  herein  will  be  included  in  a  sequel  (Part  II),  to  be 
published  separately,  to  illustrate  the  effectiveness  of  the  methods. 

1.  INTRODUCTION 


An  artificial  neural  network  (ANN)  can  be  viewed  as  an  analog  computational  device  that 
implements  a  potentially  highly  nonlinear  function.  That  is,  an  ANN  simply  computes  the 
transfer  function  g  in  the  relationship  y=g(x),  where  g  is  a  suitably  well  behaved  (e.g. 
measurable,  continuous,  differentiable,  etc.)  mapping  from  the  n-dimensional  hypercube  [0,1]^ 
to  the  m-dimensional  real  numbers,  R™.  A  typical  ANN,  with  five  input  nodes  (neurons),  three 
middle  layer  nodes,  and  two  output  layer  nodes,  is  depicted  in  figure  1.  A  simplified 
explanation  of  the  operation  of  this  type  of  ANN,  known  as  a  single  hidden  layer  feedforward 
ANN  is  described  as  follows. 


The  values  of  the  components  of  an  input  vector  x  are  "presented"  to  the  "input  layer 

nodes"  of  the  network.  Linear  combinations  of  these  values  are  formed  using  the 

MI 

"interconnection  weights"  w^  ,  and  "fed  forward"  to  the  "middle  layer  nodes,"  each  computing  a 
function  F^,  typically  defined  by  the  logistic  "sigmoid"  function  F(v)  =  exp(v)(l+exp(v))'^  on 


its  input.  Linear  combinations  of  these  middle  layer  outputs  are  then  formed  using  the 
interconnection  weights  and  fed  forward  to  the  "output  layer  nodes"  where  a  (typically,  the 

same)  function  is  applied  to  produce  the  final  output  y.  Mathematically,  the  transfer  function 


g  is  given  by 

where  often,  but  not  necessarily,  F  =F  =F,  y-  is  the  ith  component  of  y,  and  x.  is  the  kth 

o  m  1  ic 

component  of  x.  A  powerful  feature  of  such  an  ANN  is  its  ability  to  approximate  a  wide  variety 

of  transfer  functions  g  by  varying  the  number  of  input,  middle,  and  output  layer  nodes  and  the 

corresponding  interconnection  weights.  In  fact,  it  was  proved  (Kolmogorov,  (1957))  that  if  g  is 

continuous,  then  g  has  an  exact  representation  of  the  type  that  could  be  implemented  by  a  neural 

network  of  the  type  in  figure  1,  providing  that  the  individual  neurons  be  allowed  to  compute 

possibly  different  (not  necessarily  sigmoidal)  transfer  functions.  In  fact,  this  theory  even 

specifies  the  number  of  middle  layer  nodes,  (2n+l),  if  n  is  the  number  of  input  layer  nodes  (i.e. 

the  dimension  of  the  input  vector  x).  Using  the  mathematical  tools  of  functional  analysis,  it  has 

been  proven  that,  loosely  speaking,  fairly  general  functions  g  can  be  approximated  to  any 

desired  degree  of  accuracy  using  the  sigmoidal  logistic  function  F  for  F^  and  F^  in  (1),  and  by 

increasing  the  number  of  middle  layer  nodes  (see  Cybenko  (1989),  for  example). 
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Because  of  this  ability  to  approximate  a  wide  range  of  multivariate  functions,  ANNs  have 
been  used  as  nonlinear  regression  functions  for  the  purpose  of  developing  predictive 
relationships.  A  formal  mathematical  model  for  nonlinear  regression  is 

y  =  g(x;  0)  +  e  (2) 

where  y,  x,  and  E  are  jointly  distributed  random  m-,  n-,  and  m-vectors,  respectively,  e  is 
independent  of  y  and  x,  E(£)  =  0,  Cov(e)  =  E(ee*)  =  Z,  and  6  is  a  vector  of  unknown  parameters 
(the  interconnection  weights,  in  the  case  of  an  ANN).  See,  for  example,  Seber  et.al.  (1989)  and 
Gallant  (1987)  for  extensive  treatments  of  models  such  as  (2).  One  objective  in  using  the  model 
(2)  is  to  first  determine  an  estimate  of  the  unknown  parameter  6  based  on  a  random  sample  from 
the  joint  distribution  of  (x,  y),  and  then  use  this  estimate  in  the  model  to  predict  responses  y  at 
new  inputs  x.  For  ANNs,  this  procedure  is  carried  out  in  principle  by  "training"  the  network  on  a 
sample  of  exemplars  (Xp  Y^, ...,  (X^^,  Yj,^),  where  X^  and  Y.  are  n-  and  m-dimensional  vectors, 

respectively.  An  approach  to  this  training  is  to  determine  the  interconnection  weights  that 
minimize  the  mean  squared  error 

Q  =  (l/N)lJ!j(Yj-Op'(Yj-Oj)  (3) 

where  is  the  actual  output  vector  displayed  on  the  output  layer  of  the  ANN  when  input  vector 

X-  is  presented  to  the  input  layer,  and  "t"  signifies  matrix  transpose,  interpreting  the  vectors 

involved  in  (3)  to  be  column  vectors.  Note  that  O-  will  generally  differ  from  Y.  due  to  random 

error  (the  "e"  in  equation  2)  and  the  approximation  error  (since  the  g  in  equation  2  may  not  be 
exacdy  of  the  parametric  form  implementable  by  an  ANN;  i.e.  of  the  form  of  equation  1).  The 
procedure  (with  variations)  that  is  commonly  used  in  determining  the  connection  weights  by 
minimizing  (3)  is  the  back-propagation  algorithm  (Soulie  et.al.  (1987),  Hecht-Nielsen  (1991), 
Hertz,  et.al.  (1991),  and  Rummelhart  et.al.  (1986)).  A  significant  advantage  of  the 
back-propagation  algorithm  is  that  the  network  itself  can  carry  out  the  minimization  and 
estimation  procedure,  without  external  software,  making  it  possible  to  imp'ement  a  neural 
network  completely  in  hardware  or  firmware.  Having  trained  the  net'vork  on  the  set  of 
exemplars,  it  is  then  used  to  predict  new  responses  based  on  new  input.«.  That  is,  the  network  is 
used  to  generalize. 
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Great  success  has  been  achieved  in  using  neural  networks  in  this  fashion  in  many 
engineering,  economic  /  financial,  and  biomedical  applications.  Recognition  that  the  use  of 
ANNs  in  this  context  fits  within  the  framework  of  nonlinear  regression  analysis  appeared  in  the 
literature  fairly  recently,  although  this  was  apparently  understood  by  ANN  researchers  much 
earlier.  Angus  (1989)  gives  an  exposition  of  this  connection  along  with  an  interpretation  of  the 
back-propagation  algorithm  as  a  version  of  stochastic  gradient  descent.  White’s  (1989)  landmark 
paper  was  the  first  to  show  that  in  the  context  of  statistical  estimation  theory,  the  back 
propagation  estimator  of  the  interconnection  weights  are  relatively  inefficient  (i.e.  have  larger 
variance)  compared  to  standard  nonlinear  least  squares  estimators,  and  presents  a  method  (which 
amounts  to  taking  one  Newton-Raphson  iteration  step  from  the  back-propagation  estimators)  for 
reducing  the  asymptotic  variances  of  the  back  propagation  estimators  down  to  those  of  the 
nonlinear  least  squares  estimators.  Further  along  the  lines  of  improving  the  statistical  efficiency 
of  back  propagation  estimators,  Angus  (1991)  gives  a  Monte  Carlo  data  generation  method  that 
reduces  the  mean  squared  error  of  the  back  propagation  estimators  when  sufficient  statistics  are 
available. 

Another  use  of  ANNs  that  has  shown  promise,  especially  in  the  area  of  medical 

diagnosis,  is  in  classification.  Here,  (X,  Y)  is  assumed  to  be  a  random  sample  of  size  1  from  a 

probability  distribution  p(A,  j),  where  A  is  a  Borel  subset  of  R*^,  and  j€  { 1,  ...,  K}.  Here,  j  is 

assumed  to  signify  one  of  K  distinct  populations  ("classifications",  or  diagnoses),  and, 

conditional  on  Y=j,  X  has  a  distinct  probability  distribution  on  R^.  That  is,  P{XGA,Y=j)  = 

p(A,j),  and  P{XeA|Y=j}  =  J  p(dx[j),  where  p(A|j)  =  p(A,J)/7t(j)  and  7:(j)  is  the  marginal 

A 

probability  that  Y=j  (i.e.  that  "X  comes  from  population  j").  In  the  classification  problem,  X  is 
observed  (without  its  corresponding  Y)  and  one  must  decide  the  population  from  which  X  came. 
That  is,  one  must  predict  Y  based  on  observing  X.  If  p(»lj)  has  a  density  f  with  respect  to 

Lebesgue  measure  on  R*^,  and  X=x  is  observed,  then  the  optimum  decision  is  to  classify  X  into 
the  population  j  for  which  7C(j|x)=fj (x)Tc(j)j^Xj^jfj(x)Jt(j)j  is  maximal.  This  latter  quantity  is 

P{  Y-j|X=x},  and  this  decision  rule  is  called  the  Bayes  rule.  In  practice,  neither  the  7t(j)s  nor  the 
fjS  are  known,  but  a  training  sample  (Xp  Yp,  ...,  (Xj^,  Yj^)  is  available,  and  an  approximation 

to  the  function  7t(j|x)  is  "learned"  by  an  appropriate  ANN  using,  for  example,  back  propagation 
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to  minimize  (3).  Such  an  ANN  would  employ  sigmoidal  transfer  functions  at  the  output  nodes 
that  restrict  the  outputs  to  lie  in  [0,1]  (such  as  the  logistic  sigmoid  function),  and  the  output  layer 
would  have  m=K  output  nodes.  The  output  Y=j  would  be  designated  by  fixing  the  jth  output 
node  at  1,  and  the  other  nodes  at  0.  The  network  thus  trained  would  then  be  used  to  classify  new 
X  values  into  one  of  the  K  populations.  Other  approaches  to  this  classification  problem  include 
discriminant  analysis,  in  which  the  Xs  are  assumed  to  come  from  one  of  K  different  multivariate 
normal  populations,  kernel  density  estimation,  kth  nearest  neighbor  rules,  and  classification  / 
regression  trees  (CART).  Breiman  et.al.  (1984)  is  the  definitive  reference  for  CART  methods, 
and  also  gives  a  brief  description  of  the  first  three  methods.  An  ANN  that  directly  attacks  the 
kernel  density  estimation  problem  is  studied  by  Marchette  and  Priebe  (1989). 

Despite  its  success  in  many  sophisticated  applications,  this  "training-generalization" 
application  of  ANNs  has  a  serious  drawback  analogous  to  the  misspecification  problem 
(undeifitting  or  overfitting)  in  classical  statistical  models,  as  exemplified  in  figure  2  for  the 
nonlinear  regression  application.  If  the  ANN  architecture  has  enough  middle  layer  nodes,  then 
by  forcing  the  mean  squared  error  (3)  to  be  small  enough,  the  network  can  be  made  to  collocate 
the  exemplars  exactly.  Since  the  responses  Yj  in  the  exemplar  set  typically  have  measurement 

error  according  to  the  model  (2),  this  means  that  the  ANN  can  be  made  to  force  the 
approximating  surface  to  pass  through  the  points  (Xj,  g(X.;0)+ej)  i=l,  ...,  N.  This  is  what  is 

meant  by  "fitting  the  noise,"  and  it  leads  to  a  regression  surface  that  is  irregular  (i.e.  "bumpy") 
and  hence  poor  at  generalization.  Similarly,  if  there  are  too  few  nodes  in  the  middle  layer,  then 
the  ANN  will  be  a  poor  approximation  to  the  true  regression  surface,  being  able  to  achieve  only 
limited  generalization  capability.  White  (1981)  discusses  the  misspecification  problem  for 
general  models  of  the  form  (2). 


Figure  2.  The  effects  of  overfitting  and  underfitting. 


Selection  of  the  proper  size  of  the  network  is  thus  of  vital  concern  in  applications,  and  is 
the  major  thrust  of  this  paper.  Following  is  a  discussion  of  techniques  that  are  used  in  other 
modeling  contexts  to  select  the  proper  parametric  model,  and  general  principles  that  aid  in 
attacking  the  problem  for  ANNs. 

2.  APPROACHES  AND  PRINCIPLES  IN  MODEL  SELECTION 

Model  selection  has  been  studied  extensively  by  many  researchers  for  a  variety  of 
statistical  models  related  to  (2).  For  intrinsically  linear  models,  selection  is  tantamount  to 
selecting  the  proper  regressor  variables  (e.g.  linear,  quadratic,  cross  product,  etc.)  along  with  the 
dimension  of  the  unknown  parameter  space.  For  example.  Mallows  (1964,  1973)  considers  the 
general  linear  model,  and  the  Mallows  measure  Cp  is  extensively  used  in  computerized  stepwise 

linear  regression  packages.  See  also  Myers  (1990)  for  specific  implementations  and  derivations 
of  Cp.  Model  selection  is  also  of  extreme  importance  in  time  series  models  where  prediction 

and/or  interpolation  are  of  interest.  See,  for  example,  Shibata  (1976),  Bhansali  (1978),  Hannan 
et.al.  (1979),  Wei  (1987),  and  Hemerly  et.al.  (1989)  for  model  selection  criteria  analysis  for 
autoregressive  and  stochastic  regression  models  under  minimal  distributional  assumptions. 
Hemerly  et.al.  (1991)  have  also  studied  the  problem  of  determining  the  order  of  an 
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autoregressive  model  when  there  is  no  a  priori  upper  bound  on  this  order. 

When  more  specific  distributional  and  parameter  information  are  available  (i.e.  the 
distribution  of  the  error  vector  e  in  (2),  and  prior  information  concerning  0)  several  authors  have 
found  optimal  model  selection  criteria  in  a  Bayesian  context.  Among  these,  see  Atkinson  (1978) 
and  Smith  et.al  (1980).  Similarly,  Schwarz  (1978)  has  found  a  fairly  simple  criteria  to  apply  in 
selecting  a  linear  model  from  a  subset  of  linear  models  (with  bounded  dimensions)  that  is 
asymptotically  optimal  when  the  observables  follow  a  regular  exponential  family  of 
distributions.  Schwarz’s  criteria  has  been  named  the  Schwarz  Information  Criteria  (SIC),  and 
will  be  discussed  further  later  on.  Other  authors  have  studied  the  general  problem  of  model 
selection  based  on  various  notions  of  information.  See,  for  example,  Akaike  (1969,  1973,  and 
1974)  (the  "Akaike  Information  Criterion,  or  AIC),  Stone  (1977b,  1978),  and  Rissanen  (1976, 
1986). 

A  controlling  theme  in  these  investigations  in  model  selection  is  the  determination  of  a 
measure  of  prediction  accuracy,  or  model  fit,  that  takes  into  account  both  prediction  variance 
and  prediction  bias,  the  former  tending  to  increase  with  model  complexity  (e.g.  the  number  of 
terms  in  the  linear  regression  function),  and  the  latter  tending  to  decrease  with  model  complexity. 
Based  on  a  training  sample,  the  "best"  model  is  then  selected  to  achieve  some  optimum  balance 
between  these  competing  quantities  as  estimated  in  some  fashion  from  the  sample.  Naturally,  the 
more  information  available  concerning  the  distributional  structure  of  the  error  term  in  (2),  the 
more  efficient  is  the  model  selection  criterion  in  terms  of  accuracy  and  sample  size  necessary  to 
achieve  a  decision.  For  example  both  the  AIC  and  SIC  (which  is  asymptotically  optimal),  require 
that  the  joint  likelihood  function  of  the  observables  be  available  and  tractable  in  order  to  be  used. 
In  contrast,  the  PLS  (Predictive  Least  Squares)  criterion  discussed  in  Hemerly  et.al.  (1989)  for 
determining  the  order  of  an  autoregressive  model,  and  Mallows  Cp  measure  for  linear  models, 

are  computable  from  relatively  simple  sample  characteristics.  By  nature,  however,  all  the  above 
methods  are  computationally  intensive,  as  they  require  fitting  many  candidate  models  to  the 
training  sample  in  order  to  make  the  final  selection. 

In  the  context  of  classification  models,  Breiman  et.al.  (1984)  give  extensive  discussion 
and  methods  for  growing  and  pruning  classification  and  regression  trees  (CARTs)  based  on 
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cross-validation  estimates  of  misclassification  probabilities.  There,  the  estimated 
misclassification  probability  (a  measure  of  accuracy)  is  traded  off  against  the  number  of  terminal 
nodes  of  the  tree  (a  measure  of  complexity). 

Some  specific  work  for  ANNs  has  been  done  in  the  area  of  network  size  selection,  l.iese 
approaches  generally  fall  into  two  categories:  those  that  attempt  to  "prune"  a  large  network  of 
connections  and/or  nodes,  and  those  that  attempt  to  "grow"  a  larger  network  starting  with  a  small 
network.  The  work  in  the  latter  category  is  mostly  related  to  ANNs  with  nodes  that  take  values 
in  a  discrete  set  (e.g.  {0,1 },  or  (-1,1 }).  See,  for  example,  Marchand  et.al.  (1990),  Frean  (1990), 
Mezard  et.al.  (1989),  Sirat  et.al.  (1990),  Fahlman  et.al.  (1990),  and  Gallant  (1986).  In  the  first 
category,  Sietsma  et.al.  (1988),  Hinton  (1986),  Scalettar  et.al.  (1988),  Kramer  et.al.  (1989), 
Hanson  et.al.  (1989),  and  Chauvin  (1989),  have  made  contributions,  attacking  the  problem  by 
modifying  the  training  rule  (i.e.  back  propagation  algorithm)  to  consider  a  penalty  term  in  (3)  to 
discourage  complexity  (complexity  increases  as  number  of  interconnections  and  /  or  nodes 
increases).  Other  novel  approaches  to  ANN  selection  and  design  include  the  use  of  genetic 
algorithms,  which  draw  analogies  with  genetic  natural  selection  for  evolution  and  survival  in 
biological  populations  (Miller  et.al.  (1989),  and  Harp  et.al.  (1990)). 

These  aforementioned  approaches  are  either  not  suitable  for  continuum-valued  input  / 
output  neurons,  or  they  require  extensive  modification  of  the  back  propagation  algorithm,  a 
luxury  that  may  not  be  feasible  nor  desirable  (for  example,  if  the  ANN  is  implemented  in  a 
"canned"  computer  routine,  or  in  hardware/firmware).  In  addition,  they  do  not  attempt  to 
optimize  the  network  with  respect  to  some  statistical  measure  of  prediction  error,  and  do  not 
"preprocess"  the  input  vector  before  presenting  it  to  the  ANN.  The  purpose  of  preprocessing 
would  be  to  eliminate  redundant  information  in  the  input  vector,  attempting  to  encompass  most 
of  the  variability  in  the  input  sample  space  with  far  fewer  dimensions.  This  concept  is  used 
successfully  in  human  learning,  the  preprocessing  initially  accomplished  via  teachers  and 
mentors,  until  the  same  level  of  input  discrimination  is  learned  by  the  pupil. 

It  is  therefore  proposed  in  this  paper  that  the  selection  of  an  ANN  for  a  given  application 
and  training  sample  be  accomplished  in  two  stages.  The  first  stage  involves  preprocessing  of  the 
input  data  to  achieve  reduction  in  dimensionality  if  possible.  The  second  stage  is  to  grow  an 
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appropriate  ANN,  without  modifying  the  back  propagation  algorithm,  that  is  of  the  "best"  size  in 
terms  of  balancing  prediction  variance  with  prediction  bias  (when  regression  structure  is  present 
as  in  (2))  or  balancing  misclassification  probability  with  a  measure  of  network  complexity  (when 
classification  is  the  goal).  More  will  be  said  concerning  these  tradeoffs  later  on.  It  is  instructive 
at  this  point  to  review  the  principles  behind  some  of  these  past  techniques  (e.g.  of  Schwarz 
(1978)  and  Mallows  (1973))  for  their  pedagogic  value  and  in  order  to  motivate  the  algorithms 
that  will  be  proposed  later  in  this  paper  for  the  second  stage  of  selection  of  an  ANN.  The  first 
stage,  the  preprocessing  of  input  data,  will  be  accomplished  via  principal  components  analysis 
(see  Rao  (1973),  for  example).  A  relatively  new  technique,  SIR  (sliced  inverse  regression,  Li 
(1991)),  is  also  a  feasible  tool  in  achieving  the  dimensionality  reduction  of  inputs  in  the  case 
where  the  ANN  yields  1 -dimensional  output.  Further  development  of  this  concept  will  be  a  topic 
for  future  research. 


Mallows’  Cp  measure  will  now  be  reviewed  briefly.  For  the  moment,  consider  the 

full-rank  linear  model  Y=Xp-t-e  where  Y  is  an  Nxl  observation  vector,  X  is  an  Nxm  matrix  of 
"independent"  variables,  P  is  a  mxl  vector  of  unknown  parameters,  and  e  is  an  Nxl  error  term 
that  satisfies  E(e)=0,  Cov(e)=a  I.  It  is  well  known  that  the  least  squares  estimator  of  P  is  P  = 
(X^X)'^X^Y,  and  this  estimator  is  the  best  linear  unbiased  estimator  of  p.  Suppose  0<p<m,  and 
that  X  and  P  are  partitioned  as  X=(XjlX2),  p^KPjip^),  so  that  Y=XjPj-i-X2p2+e  where  X,  is 

Nxp,  X2  is  Nx(m-p),  Pj  is  pxl,  and  P2  is  (m-p)xl,  and  that  we  fit  an  underspecified  model  by 


V  1  vt 


assuming  that  estimating  pj  =  (X‘^Xj)'‘X‘^Y.  Denote  the  rows  of  Xjby  x^.  .,  x|,j.  Then 

th  ^  t^ 

the  predicted  value  of  y-,  the  i  component  of  Y,  based  on  this  fitted  model,  is  y-  =  x-Pj.  The 

2 

total  expected  prediction  error,  summed  over  the  data  points  and  normalized  by  o  ,  is 

e(x|!,  (yi-Ey^)  V)  ^  (4) 

Notice  that  var(y.)  =  x|Cov(Pj)Xj  =  x|(XjXj)'*Xj  and,  since  £^jX.x|  =  XjX,, 


lil,>‘’(X',X,)''xi=5:J^|t{(X',X,)-‘xjxf]  =  tr[(x'X|)‘‘(X'x,)]  =  p, 


tv  \-  1  .  ..t 


ft  V  1  /vt  ■ 


where  "tr”  denotes  matrix  trace,  so  that  (4)  becomes 


(5) 


Writing  the  vector  of  biases  as  Y-Xp  =  Xj^j-XjPj-X2P2'  i"  (5)  can  be  written  as 

X|ljbias2(?p  =  (E(XjPj)-XjPj-X2p2)y(E(Xj^  p-XjPj-X^p.) 

=(X2p2Al-X^(X^^Xp-^X*)X2p2,  (6) 

with  (6)  following  since  the  matrix  in  the  quadratic  form  is  idempotent.  Notice  now  that  the 
expected  value  of  the  error  sum  of  squares  in  fitting  the  underspecified  model  is  given  by 

E(SSE)  =e(  (Y-X^ppVV-XjPp)  =  tr[E((I-Xj(X*  Xj)'^x\)YY^)] 

=  tr  [d-X  j(X\xp'^X^p(G^I+Xpp*xS]  =  G^(N-p)  +(X2P2)*(I-Xj(X\xp'^x\)X2p2 

2  0  A 

=  g^(N-p)  +  iJ^jbias^(yj), 

2 

so  that,  letting  s  =  SSE/(N-p),  (4)  becomes 

p  +  (N-p)(E(s^)-G^)/g^.  (7) 

2  ^2 

If  an  independent  estimate  of  G  is  available,  call  it  G  ,  then  (7)  can  be  estimated  by  the  Cp 
statistic  (Mallows  (1973))  given  by 

Cp  =  p  +(N-p)(s2-g  hta  (8) 

The  importance  of  the  statistic  (8)  is  that  it  expresses  the  tradeoff  between  the  number  of  terms  in 
the  regression  model  (p)  and  the  prediction  bias,  and  it  is  used  in  selecting  the  best  regression 
model  by  selecting  the  model  with  the  lowest  Cp  value  among  candidate  models.  This  selection 

is  usually  carried  out  graphically  by  fitting  the  candidate  models  and  plotting  Cp  versus  p,  and 

selecting  the  model  whose  p  is  closest  to  the  line  Cp=p. 

The  PRESS  (prediction  sum  of  squares)  method  (see  Myers  (1991),  ch.  4,  for  example) 
is  based  on  the  principle  of  considering  the  prediction  error  of  a  fitted  model  (i.e.  fitted  to  a 
training  sample)  when  used  to  predict  the  value  of  a  new,  independent  exemplar.  This  principle 
also  underlies  the  derivation  of  Akaike’s  (1974)  AlC  and  (1969)  FPE  (final  prediction  error) 
criterion,  as  well  as  the  PLS  criterion  studied  by  Hemerly  et.al.  (1989,  1991).  The  basic  idea  is  to 
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define  a  measure  of  the  prediction  error  for  the  fitted  model  with  respect  to  a  new,  independent 
exemplar,  and  then  develop  a  cross-validation  estimator  of  the  measure  by  splitting  the  training 
sample  into  a  fitting  sample  and  a  validation  sample.  The  PRESS  method  is  described  as 
follows. 


Assume  the  same  linear  model  setup  as  in  the  discussion  of  Mallows’  Cp  statistic.  For 
each  i,  i=l,  ...,  N,  remove  the  pair  (x.,  from  the  training  sample  and  fit  the  least  squares 

A 

estimator  of  p  based  on  the  N-1  remaining  points.  Call  the  resulting  estimator  (3  j,  to  designate 

A  fA 

that  the  ith  data  point  has  been  removed.  Form  the  prediction  y^  ^  =  x.p  •  of  y-,  i=l, ...,  N.  Note 

A 

that  neither  x.  nor  y-  have  been  used  in  determining  P  j.  The  PRESS  residuals  are  defined  by 

A 

e-  •=  y.-y.  ■,  i=l, ...,  N,  and  the  PRESS  is  defined  to  be 

i,"!  1  I)”! 

PRESS=y^\e“  . 
n=l  1. 

PRESS  contains  both  components  of  prediction  variance  and  prediction  bias,  as  does  Cp.  An 

advantage  of  the  PRESS  residuals  is  that  they  are  particularly  sensitive  to  points  where 
prediction  is  poor,  while  ordinary  residuals  (which  measure  empirical  fit)  are  not.  In  fact,  it  can 

»A 

be  shown  that  the  PRESS  residuals  are  related  to  the  ordinary  residuals  e-  =  Yj-x^P  by  the  formula 
e.  •=e-/(l-xJ(X^X)  ^x.)  as  follows.  Let  h  .=X;(X*X)'^x..  Then,  by  definition,  the  ith  PRESS 

ly'i  1  i  1  li  1  1 


residual  is 

e.  .  =  y-xTx^X-x.xH'^CxV-x.y.). 

1,-1  -^1  iL  1  iJ  rr 

By  the  Sherman-Morrison-Woodbury  Theorem  (Rao  (1973)), 

r  ,  „-l  ,  ,  (x‘x)''x.x|(x'x)’' 

r X^X-X.xM  =  (X^X)  *  -t- - ^-5 - 

L  1  iJ  '  '  l.h.. 


so  that 


t  t  .  t  t  \  x|(X'X)'‘x.x|(x‘x)'^x‘y  x|(X*X)-^x.x|(x‘X)-^x.y. 

ei.i=yrx|(X*X)'^xW(X^X)-^x.yp-5 -  ‘‘ 


1-h. 


11 


1-h.. 
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y.(l-h-.)-y  (l-h  -)+h  .(l-h  .)y  -h- y-^h-y.  y  -y.  e- 
■^r  ir  IV  ir  ir-^i  ir  i  n-'^i  ■'i  i 


Thus,  the  PRESS  can  be  computed  without  fitting  the  N-1  "leave-one-out"  regressions  by  the 
formula 


PRESS  = 


and  the  PRESS  residuals  are  seen  to  be  ordinary  residuals,  inflated  by  (l-hjj)’\  where 

t  t  -1  2 

hjj=Xj(X  X)  X.  is,  apart  from  a  missing  factor  of  a  ,  the  ordinary  prediction  variance.  Hence, 

where  prediction  is  poor  (i.e.  h^.  close  to  1),  the  PRESS  residual  is  greatly  inflated.  When 

several  candidate  models  are  under  consideration,  the  one  with  the  smallest  PRESS  is  the  model 
of  choice  under  this  criterion. 


Mallows’  Cp  and  the  PRESS  residual  methods  were  derived  under  minimal 

distributional  assumptions  on  the  error  vector  e,  but  with  the  fairly  strong  assumption  that  the 
regression  was  intrinsically  linear.  Essential  use  was  made  of  this  fact  in  computing  Cp,  and  for 

deriving  a  simple  computational  formula  for  the  PRESS  statistic  that  avoids  fitting  all  the  linear 
regressions.  When  more  specific  information  is  available,  the  optimality  of  these  selection 
procedures  can  be  improved,  and  an  (asymptotically)  optir  p  xedure  can  be  derived.  This 
was  accomplished  by  Schwarz  (1978)  as  follows. 

Suppose  that  X^  ...,  Xj^  are  a  random  sample  from  a  regular  exponential  family  of 

distributions  with  probability  density,  with  respect  to  a  a-finite  measure  p  on  the  sample  space, 
given  by  f(x;9)  =  exp(0V(x)-T](9)),  where  9  ranges  over  the  natural  parameter  space  0,  a  convex 
subset  of  the  d-dimensional  Euclidean  space,  and  y  is  the  sufficient  d-dimensional  statistic.  The 
competing  models  are  assumed  to  be  defined  by  restricting  the  parameter  space  to  subsets  of  the 
original  0  of  the  form  Ljn0,  where  each  Lj  is  a  dj-dimensional  linear  submanifold  of 

d-dimensional  Euclidean  space,  0<dj<d,  je  J,  where  J  is  a  finite  index  set.  Suppose  that  9  has  an 

a  priori  probability  measure  of  the  form  t{d9)  =  ,a.xTd9),  where  a.=P{model  j  is  correct), 

J  J  J 
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and  tj{»},  the  conditional  a  priori  distribution  of  0  given  that  model  i  is  correct,  has  a 

nonsingular  dj-dimensional  density  on  LjO©  with  respect  to  dj-dimensional  Lebesgue  measure 

that  is  bounded  and  locally  bounded  away  from  zero  throughout  LjO©.  Notice  that  it  is  being 

assumed  that  the  total  number  of  possible  models  is  finite,  and  that  there  is  an  upper  bound, 
namely  d,  on  the  dimensionality  of  the  model.  In  this  Bayesian  context,  the  optimum  decision  is 
to  choose  the  model  with  the  highest  posterior  probability.  By  Bayes’  formula,  the  posterior 
probability  that  model  j  is  correct,  given  Xp  ...,  Xj^,  is 


P{model  j  is  correctjXp  ...,  X^^} 

=  aj  exp(N(eVTi(0))x.{de)[l.  .aj  exp(N(0*Y-Ti(0))t.{d0)l  ^  (9) 

J^LjO©  J  L  J-^Ljn©  J  J 

where  Y=(l/N)X[ljy(Xj).  Since  the  normalizing  constant  is  the  same  for  each  j,  and  since 

xt-»-ln(x)  is  a  monotone  increasing  mapping  for  x>0,  the  optimum  decision  is  to  choose  the  model 
corresponding  to  j  having  the  largest  value  of 

S(Y,  N,  j )  =  ln(a. )  +  ln(  f  exp(N(0V-Ti(0))T.{d0}).  (10) 

J  ^•'LjO©  J  ’ 

The  asymptotic  behavior  of  (10)  is  of  interest.  An  asymptotic  expansion  of  (10)  is  easy  to  derive 
arguing  heuristically.  A  rigorous  derivation  is  given  in  Schwarz  (1978). 


From  the  theory  of  asymptotic  expansions  of  integrals  of  the  form  in  (10),  it  follows  that 
the  asymptotic  behavior  of  the  integral  is  determined  by  that  of  the  integral  over  a  small 
neighborhood  about  the  value  of  0  at  which  0V-Ti(0)  takes  on  its  maximum  in  LjO©.  Call  this 

value  0Q.  Expanding  0^Y-ri(0)  in  a  Taylor  series  about  0q,  recognizing  that  the  linear  term  in  the 

expansion  vanishes  since  0q  yields  a  maximum,  it  follows  that  for  0  near  0q,  0,  0q€  LjPi©, 


0V-T1(01  =  0q*Y-ti(0q)  -  (1/2)(0-0q)^Iq^0-0q), 


where  - - 4-  is  nnnnegative  definite,  since  it  is  the  covariance  matrix  of  y(X,)  when 


0=0Q.  Using  this  in  (10)  gives 


S(Y,  N,  j)  -  N  suPq^j^  ^q(0V-ii(0))  -  ln(  Vtj{d0}) 

^  j 

Let  f.  be  the  density  of  tj.  Because  of  the  assumptions  on  L,  and  by  making  a  linear 
transformation  of  the  integration  variables,  the  last  written  integral  is  asymptotic  to 

f.i 


where  ^  is  a  positive  constant.  Using  this  in  (10)  yields 

S(Y,  N,  j)  -  N  suPq^  ,  ^0(e Vti(0))  -(d./2)ln(N),  as  N-oo.  (11) 

Notice  that  the  supremum  in  (1 1)  is  just  the  maximum  of  the  log-likelihood,  the  maximum  being 
taken  over  L.jr>0.  Denoting  the  maximum  over  OeLjO©  of  the  likelihood  function  by 

Mj(Xp...,Xj,^),  the  Schwarz  criterion  is,  from  (11),  to  choose  the  model  j  for  which 

ln(Mj(Xj,...,Xj,^))-(dy2)ln(N)  (12) 

is  maximized,  je  J. 


Akaike  (1974)  defines  a  similar  criterion  to  (12),  called  the  AIC,  which  amounts  to 
choosing  the  model  j  having  the  maximum  value  of 

ln(Mj(Xj,...,Xj^))-dj.  (13) 

Of  course,  the  work  of  Schwarz  (1978)  and  the  preceding  discussion  shows  that  (13)  is  not 
asymptotically  optimum  in  the  aforementioned  setting.  Because  of  the  ln(N)  multiplier  in  (12), 
the  SIC  tends  to  favor  lower  dimensional  models  than  the  AIC,  and  in  fact,  several  authors  have 
observed  that  the  AIC  tends  to  overestimate  the  dimension  of  the  model  (Shibata  (1976),  Jones 
(1975)). 


Despite  the  pedagogic  value  of  these  considerations,  it  is  clear  that  neither  Mallows’ 


nor  the  SIC  are  directly  applicable  to  the  determination  of  the  best  size  for  a  neural  network 
based  on  a  given  set  of  training  data.  Conceptually,  however,  they  suggest  that  a  good  procedure 
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would  be  based  on  selecting  a  network  that  achieves  a  balance  between  the  closeness  with  which 
the  model  fits  the  training  data,  and  the  "dimensionality"  of  the  approximating  regression 
surface. 

3.  ANN  SELECTION  FOR  THE  NONLINEAR  REGRESSION  PROBLEM 

In  this  section  an  algorithm  is  described  and  proposed  for  selecting  the  size  of  a  single 
hidden  layer  feedforward  neural  network  as  described  in  section  1  and  figure  1.  The  "size"  of  the 
network  will  be  defined  to  be  the  number  of  hidden  layer  nodes,  the  input  layer  and  output  layer 
sizes  being  dictated  by  the  dimensions  of  the  X.  (input)  and  Y.  (output)  vectors.  No  attempt  is 

being  make  here  to  limit  the  number  of  interconnections  for  a  given  network  size.  This  section 
treats  the  case  in  which  regression  structure  is  present. 

Suppose,  as  in  section  1,  that  a  random  training  sample  (X^,  Y^),  .  .  .  ,  (Xj,^,  Y^^)  is 

available.  Here,  it  is  assumed  that  Xj  and  Y.  are  jointly  distributed  n-  and  m-  dimensional 

random  vectors,  respectively.  Suppose  that  Yj  =  gp(Xj;6)  +  e-,  where  E(ej)=0,  cov(ej)=X,  i=l, 

...,  N,  and  that  the  e-s  are  independent  and  identically  distributed.  The  function  gp  is  assumed  to 

belong  to  the  class  of  functions  that  are  represented  by  a  single  hidden  layer  feedforward  neural 
network  with  p  hidden  nodes  and  given  node  transfer  functions  (e.g.  logistic  sigmoid  functions). 
Thus,  gp  depends  also  on  the  (m+n)p  interconnection  weights  of  the  network,  represented  by  the 

vector  0.  To  be  definite,  assume  that  the  back  propagation  algorithm  is  used  in  fitting  the 
interconnection  weights  for  a  given  value  of  p.  This  assumption  is  not  essential,  as  any  suitable 
numerical  method  for  finding  the  weights  based  on  a  given  training  sample  will  suffice. 

For  an  ANN  of  size  p,  assume  that  the  weights  have  been  fit  based  on  the  training 
sample,  and  that  a  new  exemplar  (X,  Y),  independent  of  the  training  set  T=((Xj,Yj),  ..., 

(Xn,y  j^)),  is  available.  As  usual,  it  is  assumed  that  T  constitutes  a  random  sample  from  the  joint 

distribution  of  (X,Y).  Define  the  prediction  mean  squared  error  by 

PMSE  =  E^llg  p(X;e)-gp^(X;0)  11^).  (14) 


-15- 


where  Pq  is  the  correct  size  for  the  network,  6  is  the  true  value  of  the  weight  vector,  and  11*11  is 

m  ^ 

the  Euclidean  norm  on  R  .In  (14),  gp(X;0)  is  the  predicted  value  of  Y  based  on  knowledge  of 

A 

X,  using  the  ANN  that  implements  the  transfer  function  gp(*;6).  The  hat  indicates  that  the 

interconnection  weights  in  gp  have  been  estimated  from  the  training  sample,  so  that  gp  contains 

the  training  sample  information  through  its  dependence  on  these  estimated  weights.  The  value 
g  (X;0)  is  the  expected  value  of  Y  given  X,  since  g  (•;0)  is  the  true  regression  function.  The 

Pq  Pq 

prediction  mean  squared  error  (14)  contains  both  prediction  variance  and  bias  components 
analogous  to  (4)  in  the  derivation  of  Mallows’  Cp.  In  fact,  ("tr"  indicates  matrix  trace) 

PMSE  =  E  tr[Cov(gp(X;0)|x)]  +  E  tr[Bias(gp(X;0)lx)Bias^(gp(X;0)|x)]  =  Pj^(p), 

the  latter  notation  used  to  indicate  the  dependence  on  both  N  and  p,  and  if  m=l  (i.e.  the  response 
variable  Y  is  1 -dimensional),  then 

PMSE  =  E(var(gp(X;§)lX))  +  E(Bias2(gp(X;0)lX)). 


Unfortunately,  g_  is  unknown,  and  there  are  no  tractable  analytical  calculations,  as  in 

Pq 

the  case  of  Cp,  that  render  (14)  useable  for  estimating  p.  The  technique  of  ordinary 

cross-validation  (OCV)  estimation  can  be  used  to  estimate  (14).  This  method,  also  known  as  the 
leave-one-out  method,  is  implemented  as  follows. 


For  each  ke  { 1,  ...,  N},  remove  the  exemplar  (Xj^,  Yj^)  from  the  training  set,  and  train  the 


^(k) 

network  on  the  remaining  N-1  exemplars.  Let  g^  '  be  the  resulting  estimated  transfer  function. 

A(k) 

Note  that  g^  is  statistically  independent  of  (Xj^,  Yj^).  Conditional  on  Xj^,  Yj^  is  an  unbiased 

estimator  of  g  (X.  ;0).  Hence,  the  OCV  estimator  of  (14)  is 
Pq  k 


V(p)  =  (l/N)I^^jllg[J'\x^)-Yj^i|2. 

The  OCV  estimator  V(p)  is  biased  in  estimating  E(PMSE).  In  fact. 


(15) 


E(V(p))=  pj^_j(p)-i-tr(Z) 
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but  the  term  tr(Z)  is  a  constant,  so  that  V(p)  is  still  a  measure  of  prediction  bias  plus  variance,  the 
former  tending  to  increase  for  p  decreasing  away  from  Pq,  while  the  latter  tending  to  increase  for 

p  increasing  away  from  Pq.  Hence,  a  reasonable  estimate  of  Pq  based  on  the  training  sample  is 

A 

Pq  =  arg  min  V(p),  (16) 

that  is,  the  value  of  p  that  minimizes  V(p)  with  respect  to  p. 

OCV  estimators  of  various  prediction  figures  of  merit  have  been  extensively  and 
successfully  used  in  many  contexts.  See,  for  example,  Breiman  et.al.  (1984)  and  Breiman  (1991) 
for  its  use  in  classification  and  regression  trees  and  regression  splines,  Wahba  (1990)  for  its  use 
in  spline  models  for  observational  data,  and  Myers  (1990,  ch.  4)  for  its  use  in  selecting  a 
regression  function  and  a  comparison  with  Mallows’  Cp  criterion.  Theoretical  work  concerning 

cross-validation  estimation  has  been  carried  out  by  Stone  (1974),  (1977a)  and  (1977b). 

Use  of  (16)  in  this  fashion  requires,  of  course,  that  potentially  many  values  of  V(p)  be 
computed,  for  various  values  of  p,  in  order  to  locate  the  minimum  value  and  corresponding  p. 
Convergence  of  the  weight  estimation  algorithm  (i.e.  the  back  propagation  algorithm  in  this  case) 
in  computing  the  predictors  gp  '  for  each  fixed  p  and  k  varying  from  1  to  N  should  be  fairly 

rapid,  as  the  weights  computed  from  the  previous  value  of  k  can  be  used  as  starting  values  for  the 
algorithm  for  the  next  k.  Nevenheless,  this  procedure  is  computationally  intensive,  and  it  is 
imperative  that  an  attempt  be  made  to  reduce  the  dimensionality  of  the  input  vector  prior  to 
attempting  to  determine  the  best  ANN  via  (16).  Approaches  to  this  are  discussed  later  on. 

4.  ANN  SELECTION  FOR  THE  CLASSIFICATION  PROBLEM 

In  this  section,  ANN  selection  is  considered  in  the  context  of  the  classification  problem. 
Here,  the  training  set  of  exemplars  T=((Xj,Yj), ...,  (Xj^,Yj^))  constitutes  a  random  sample  from 

the  distribution  p(A,j)  =  P{X€  A,Y=j)  where  A  is  a  Borel  set  in  R*^,  and  j€  { 1,  ...,  K)  represents 
K  distinct  populations.  The  interpretation  of  an  examplar  (Xj^,Yj^)  is  that  the  measurement 

variable  X^.  was  generated  from  a  subject  in  the  population  Yj^.  The  ANN  is  thus  being  used  to 

estimate  the  function  7t(j|x)  =  P{Y=j|X=x). 


As  described  in  section  3,  an  ANN  of  size  p  will  constitute  a  three  layer,  single  hidden 
layer  ANN  with  p  nodes  in  the  middle  layer,  n  input  nodes,  and  m=K  output  nodes.  The  output 
neurons  will  be  assumed  to  implement  a  sigmoidal  transfer  function  that  guarantees  that  each 
output  node  outputs  a  value  in  [0,1].  The  output  Y=j  will  be  presented  to  the  network  during 
training  by  fixing  the  jth  output  node  at  1,  and  all  other  output  nodes  at  0.  A  nontraining  output 
from  the  ANN  will  be  judged  to  constitute  "Y=j"  if  the  jth  output  node  has  the  largest  output 
value.  This  convention  is  made  in  lieu  of  constraining  the  output  node  values  to  sum  to  1 . 

As  in  section  3,  it  will  be  assumed  that  the  ANNs  are  trained  based  on  T  using  the  back 
propagation  algorithm,  and  that  no  attempt  is  made  to  limit  the  number  of  connections  ("zero 
out"  weights)  within  a  network  of  a  given  size. 

A 

Suppose  a  network  of  size  p  has  been  trained  on  the  set  T,  yielding  the  decision  rule  dp. 

A  p 

That  is,  dp  is  a  mapping  from  R  to  { 1,...,K)  with  the  interpretation  that  if  X=x  is  observed,  then 

A  A 

classify  X  into  population  dp{x).  By  the  previous  discussion,  dp(x)  =  j  if,  when  presented  with 

input  X,  the  ANN  output  node  j  produces  the  largest  value  over  all  output  nodes.  Let  a  new 
sample  become  available,  (X,  Y).  The  misclassification  probability  MP  is  defined  by 

MP*(N,p)  =  P{dp(X);tYlT),  07) 

A 

the  conditional  probability  (conditional  on  the  training  sample  T)  that  the  rule  dp  fails  to 

correctly  classify  the  new  sample.  Since  the  joint  distribution  of  (X,Y)  is  unknown,  (17)  cannot 
be  computed.  The  resubstitution  estimate  of  (17)  is 

MP(N,p)  =  (l/N)£|^jI{dp(X.)^Yj).  (18) 

A 

In  (18),  I{S}  =1  if  S  is  true,  and  I{S}=0  otherwise.  Notice  that  in  (18),  the  rule  dp  is  determined 

with  the  same  data  used  to  estimate  the  probability  of  misclassification.  Therefore,  MP  in  (18)  is 
likely  to  give  overly  optimistic  estimates  of  (17),  and  would  therefore  not  be  appropriate  by  itself 
as  a  figure  of  merit  in  sizing  the  ANN. 

Drawing  analogy  with  the  work  of  Breiman  et.al.  (1984)  for  CARTs,  and  borrowing  their 
notation  and  terminology,  define  the  cost-complexity  measure 
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(19) 


R(j^(N,p)  =  MP(N,p)  +  ap, 

where  ot^O  is  called  the  complexity  parameter.  Because  of  overfitting  to  the  training  data,  MP 
will  tend  to  decrease  as  p,  the  size  (complexity)  of  the  ANN,  increases.  Conversely,  MP  will 
tend  to  increase  as  p  decreases.  Thus,  assuming  that  the  true  conditional  probability  function 
K(j|x)  belongs  to  the  parametric  family  of  functions  that  are  implemented  by  the  ANN,  the  ANN 
true  size  Pq  can  be  estimated  by  the  value  that  minimizes  (19).  That  is.  the  selection  rule  would 

A 

be  to  choose  the  network  size  Pq  such  that 

Pq  =  arg  min  R^(N,p).  (20) 

At  present,  there  are  no  guidelines  for  choosing  a,  the  complexity  cost  per  hidden  layer  node  in 
(20).  This  will  be  a  topic  of  study  in  the  sequel  (Part  II)  to  this  study. 

5.  PREPROCESSING  OF  THE  TRAINING  SAMPLE 

Let  (Xp  Yj),  ...  ,  (Xj,^,Yj^)  be  the  training  sample  as  defined  in  section  3.  Typically,  the 

dimension  of  the  X^s,  namely  n,  is  fairly  large.  Moreover,  it  is  often  the  case  that  there  is  strong 

linear  association  between  some  of  the  components  of  the  Xs  (e.g.  pulse  rate  and  respiration  rate) 
in  which  case,  some  of  the  information  contained  in  the  X  vector  may  be  redundant.  In  order  to 
eliminate  some  of  this  redundancy  and,  in  effect,  reduce  the  dimensionality  of  the  X^s,  the 

following  method,  called  principal  components  analysis,  can  be  used.  This  principal  components 
technique  is  successfully  used  to  eliminate  multicolinearity  in  linear  regression  models  (see 
Myers  (1991),  ch.  8,  for  example). 

A  A  JU  -  -  » 

Let  Z  be  the  sample  covariance  matrix  of  the  X^s,  given  by  Z=(l/N)Zj_j(Xj-X)(Xj-X) , 

X=(l/N)z|^jXj.  Let  be  the  ordered  eigenvalues  of  Z  and  Vp  ...,  v^  the 

corresponding  set  of  onhonormal  eigenvectors.  Fix  a  threshold  ye  (0,1)  (typically,  y=-9  or  .95) 
and  select  nQ<n  to  be  the  smallest  integer  such  that  Let  V  be  the  nQxn  matrix 

whose  i^^  row  is  given  by  v|.  Let  Z-=VXj,  i=l, ....  N.  Each  Z-  represents  a  "reduced"  version  of 

Xj,  the  extraneous  u-Uq  dimensions  being  eliminated  because  they  are  associated  with  small 
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eigenvalues  of  L,  which  are  in  turn  associated  with  high  degrees  of  multicolinearity  (redundancy) 
in  the  components  of  the  X^s.  The  new  training  sample  to  be  used  in  the  ANN  is  now  given  by 

(Zj.Yj), (Zj^,  Yj^). 

6.  SUMMARY  AND  CONCLUSIONS 

Promising  approaches  have  been  presented  to  choosing  the  best  size  of  a  single  hidden 
layer  feedforward  ANN  in  both  the  nonlinear  regression  context  as  well  as  the  classification 
context.  Although  restriction  to  this  type  of  ANN  architecture  has  been  assumed,  it  is  not  viewed 
as  a  limitation  in  applications  since  there  is  compelling  theoretical  evidence  that  such  an 
architecture  maintains  sufficient  potential  for  functional  approximation.  These  approaches  have 
been  designed  to  be  applicable  to  an  ANN  without  modifying  its  training  rule  (e.g.  back 
propagation)  or  basic  architecture  (i.e.  single  hidden  layer  feedforward  type).  Other  approaches, 
whereby  the  ANN  size  selection  is  embedded  into  the  learning  algorithm  itself,  w’ere  not 
addressed  in  this  investigation.  The  effectiveness  of  the  methods  proposed  will  be  the  topic  of 
the  sequel  to  this  report.  Part  n,  in  which  simulation  examples  will  be  used  to  determine  if  the 
methods  select  the  correct  (or  nearly  correct)  size  of  ANNs  in  both  the  nonlinear  and 
classification  contexts. 

In  order  to  render  the  proposed  size  selection  algorithms  more  computationally  feasible, 
an  approach  has  been  proposed  for  reducing  the  dimensionality  of  the  input  data  to  an  ANN.  This 
approach,  the  principal  components  approach,  has  been  successful  in  classical  statistical  models 
having  similar  structure,  and  in  many  types  of  applications  it  effectively  eliminates  strong 
multicolinearities  in  high  dimensional  input  data  vectors  (see  Myers  (1990),  ch.  8,  section  4,  and 
Press  (1981),  ch.  9,  section  4,  for  example).  This  application  of  principal  components  to  ANNs 
will  often  greatly  reduce  the  potential  size  of  the  ANN,  thereby  reducing  the  computation  time 
entailed  in  applying  the  size  selection  algorithms  presented  herein. 

The  size  selection  criteria  presented  herein  have  been  chosen  based  on  proven  principles 
in  other  statistical  modeling  problems  (e.g.  CARTs,  time  series  models,  and  regression  models). 
There  are  many  factors  that  will  have  an  effect  on  their  ultimate  performance  in  applications, 
however.  For  example,  it  has  been  assumed  that  back  propagation  will  be  used  to  train  the 
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ANNs  in  applications.  When  using  this  method,  the  error  threshold  in  minimizing  (3)  is  typically 
selected  by  the  user,  and  will  have  an  effect  on  the  ANN  weights.  The  sensitivity  of  the  size 
selection  criteria  presented  here  to  this  error  threshold  will  need  to  be  investigated.  Other  factors 
that  require  investigation  include  selection  of  the  initial  size  of  network  for  a  given  problem,  the 
selection  of  the  complexity  cost  per  hidden  layer  node  parameter  in  the  classification  problem, 

A 

and  the  statistical  properties  of  the  proposed  size  estimators,  Pq. 

Finally,  the  criteria  proposed  here  may  perform  well  in  simulation  examples  where  the 
data  are  actually  generated  via  a  nonlinear  function  of  the  type  implemented  by  an  ANN  (i.e.  of 
the  parametric  form  given  by  equation  1).  In  reality  ANNs  produce,  at  best,  approximations  to 
naturally  occurring  functions.  These  naturally  occuring  functions  are  generally  not  of  the  exact 
parametric  form  of  those  that  can  be  implemented  by  an  ANN  (see  (1)),  and  hence  they  do  not 
have  their  own  "true"  pQ  values  associated  with  them  (recall  that  the  definition  of  pQ  on  page  16 

tacitly  assumes  that  the  unknown  function  is  actually  of  the  parametric  form  of  equation  1). 
Nevenheless,  finding  the  best  size  Pq  for  an  approximating  ANN  is  often  still  achievable.  Indeed, 

this  difficulty  (i.e.  that  of  the  underlying  natural  model  not  being  of  the  a  priori  assumed 
parametric  form)  pervades  statistical  modeling  in  general,  and  yet  "idealized,"  parsimoniously 
determined  parametric  models  continue  to  provide  useful  and  important  answers  to  quantitatively 
posed  questions.  Therefore,  further  investigations  into  the  effectiveness  (and  other  open 
questions)  of  the  selection  criteria  for  simulated  and  real  data  will  be  important  and  worthwhile 
undertakings. 


-21- 


REFERENCES 


Akaike,  H.  (1969).  Fitting  autoregression  models  for  prediction.  Ann.  Inst.  Statist.  Math.  21,  pp. 
243-247. 

Akaike,  H.  (1973).  Information  theory  and  an  extension  of  the  maximum  likelihood  principle. 
2nd  International  Symposium  on  Information  Theory,  pp.  267-281.  Budapest:  Akademia 
Kiado. 

Akaike,  H.  (1974).  A  new  look  at  the  statistical  model  identification.  IEEE  Transactions  on 
Automatic  Control  AC-19,  pp.  716-723. 

Angus,  J.E.  (1989).  On  the  connection  between  neural  network  learning  and  multivariate 
nonlinear  least  squares  estimation.  The  International  Journal  of  Neural  Networks,  vol.  1 ,  no. 
1.,  pp.  42-47. 

Angus,  J.E.  (1991).  Computer-assisted  improvement  of  the  estimation  mean  squared  error  with 
application  to  back  propagation  neural  networks.  Unpublished,  under  editorial  review. 

Atkinson,  A.C.  (1978).  Posterior  probabilities  for  choosing  a  regression  model.  Biometrika  65, 
39-48. 

Bhansali,  R.J.  and  Downham,  D.Y.  (1977).  Some  properties  of  the  order  of  an  autoregressive 
model  selected  by  a  generalization  of  Akaike’s  EPF  criterion.  Biometrika  64,  pp.  547-551. 

Breiman,  L.  (1991).  The  n  method  for  estimating  multivariate  functions  from  noisy  data. 
Technometrics  33(2),  pp.  125-143  (with  discussion). 

Breiman,  L.,  Friedman,  J.H.,  Olshen,  R.A.,  and  Stone,  C.  J.  (1984).  Classification  and 
Regression  Trees.  Monterey:  Wadsworth  and  Brooks/Cole. 

Chauvin,  Y.  (1989).  A  back-propagation  algorithm  with  optimal  use  of  hidden  units.  In 
Advances  in  Neural  Information  Processing  Systems  I  (Denver  1988),  ed.  D.S.  Touretzky, 
pp.  519-526.  San  Mateo:  Morgan  Kaufmann. 

Cybenko,  G.  (1989).  Approximation  by  superposition  of  a  sigmoidal  function.  Mathematics  of 
Control,  Signals,  and  Systems  2,  pp.  303-314. 

Durrett,  R.  (1990).  Probability:  Theory  and  Applications.  Belmont,  CA:  Brooks/cole. 

Fahlman,  S.E.  and  Lebiere,  C.  (1990).  The  cascade-correlation  learning  architecture.  In 
Advances  in  Neural  Information  Processing  Systems  11  (Denver  1989),  ed.  D.S.  Touretzky,  pp. 
524-532.  San  Mateo:  Morgan  Kaufmann. 

Frean,  M.  (1990).  The  upstart  algorithm:  a  method  for  constructing  and  training  feedforward 
neural  networks.  Neural  Computation  2,  198-209. 

Gallant,  A.R.  (1987).  Nonlinear  Statistical  Models.  New  York:  John  Wiley  and  Sons. 

Gallant,  S.I.  (1986).  Optimal  linear  discriminants.  In  Eighth  International  Conference  on 
Pattern  Recognition  (Paris  1986),  pp.  849-852.  New  York:  IEEE. 

Hannan,  E.J.  and  Quinn,  B.G.  (1979).  The  determination  of  the  order  of  an  autoregression. 
Journal  of  the  Royal  Statistical  Society  B  41(2),  pp.  190-195. 


-22- 


Hanson,  S.J.  and  Pratt,  L.  (1989).  A  comparison  of  different  biases  for  minimal  network 
construction  with  back-propagation.  In  Advances  in  Neural  Information  Processing  Systems 
I  (Denver  1988),  ed.  D.S.  Touretzky,  pp.  177-185.  San  Mateo:  Morgan  Kaufmann. 

Harp,  S.A.,  Samad,  T.,  and  Guha,  A.  (1990).  Designing  application-specific  neural  networks 
using  genetic  algorithm.  In  Advances  in  Neural  Information  Processing  Systems  II  (Denver 
1989),  ed.  D.S.  Touretzky,  pp.  447-454.  San  Mateo:  Morgan  Kaufmann. 

Hecht-Nielsen,  R.  (1991).  Neurocompiiting.  Reading,  Mass.:  Addison  Wesley. 

Hemerly,  E.M.  and  Davis,  M.H.A.  (1989).  Strong  consistency  of  the  PLS  criterion  for  order 
determination  of  autoregressive  processes.  Annals  of  Statistics  17(2),  pp.  941-946. 

Hemerly,  E.M.  and  Davis,  M.H.A.  (1991).  Recursive  order  estimation  of  autoregressions 
without  bounding  the  model  set.  Journal  of  the  the  Royal  Statistical  Society  B  53(1),  pp. 
201-210. 


Hertz,  J.,  Krogh,  A.,  and  Palmer,  R.G.  (1991).  Introduction  to  the  Theory  of  Neural 
Computation.  Reading,  Mass.:  Addison  Wesley. 

Hinton,  G.E.  (1986).  Learning  distributed  representations  of  concepts.  In  Proceedings  of  the 
Eighth  Annual  Conference  of  the  Cognitive  Science  Society  (Amherst  1986),  pp.  1-12. 
Hillsdale:  Erlbaum. 

Jones,  R.H.  (1975).  Fitting  autoregressions.  Journal  of  the  American  Statistical  Association  70, 
pp.  590-592. 

Kolmogoroy,  A.N.  (1957).  On  the  representation  of  continuous  functions  of  many  yariables  by 
superposition  of  continuous  functions  of  one  yariable  and  addition,  fin  Russian],  Dokl  Akad. 
Nauk  USSR  114,  pp.  953-956. 

Kramer,  A.H.  and  Sangiovanni-Vincentelli,  A.  (1989).  Efficient  parallel  learning  algorithms  for 
neural  networks.  In  Advances  fn  Neural  Information  Processing  Systems  I  (Denver  1988), 
ed.  D.S.  Touretzky,  pp.  40-48.  San  Mateo:  Morgan  Kaufmann. 

Li,  K.C.  (1991).  Sliced  inyerse  regression  for  dimension  reduction.  Journal  of  the  American 
Statistical  Association  86(414),  pp.  316-327  (with  discussion). 

Mallows,  C.L.  (1964).  Choosing  variables  in  a  linear  regression:  a  graphical  aid.  Presented  at 
the  Central  Regional  Meeting  of  the  Institute  of  Mathematical  Statistics,  Manhattan,  Kansas. 

Mallows,  C.L.  (1973).  Some  comments  on  Cp.  Technometrics  15,  pp.  661-675. 

Marchette,  D.J.  and  Priebe,  C.E.  (1989).  The  adaptive  kernel  neural  network.  Technical 
document  1676,  Naval  Ocean  Systems  Center,  San  Diego,  CA  92152-5000. 

Marchand,  M.,  Golea,  M.,  and  Rujan,  P.  (1990).  A  convergence  theorem  for  sequential  learning 
in  two  layer  perceptrons.  Europhysics  Letters  1 1,  pp.  487-492. 

Mezard,  M.  and  Nadal,  J.-P.  (1989).  Learning  in  feedforward  layered  networks:  the  tiling 
algorithm.  Journal  of  Physics  A  22,  2191-2204. 


-23- 


Miller,  G.F.,  Todd,  P.M.,  and  Hegde,  S.U.  (1989).  Designing  neural  networks  using  genetic 
algorithms.  In  Proceedings  of  the  Third  International  Conference  on  Genetic  Algorithms 
(Arlington  1989),  ed.  J.D.  Schaffer,  pp.  379-384.  San  Mateo:  Morgan  Kaufmann. 

Myers,  R.H.  (1990).  Classical  and  Modem  Regression  with  Applications.  Boston:  PWS-Kent. 

Press,  S.J.  (1981).  Applied  Multivariate  Analysis:  Using  Bayesian  and  Frequentist  Methods  of 
Inference.  Malabar,  Florida:  Robert  E.  Krieger. 

Rao,  C.R.  (1973).  Linear  Statistical  Inference  and  its  Applications.  2nd  ed.  New  York:  John 
Wiley  &  Sons. 

Rissanen,  J.  (1976).  Modeling  by  shortest  data  description.  Automatica  14,  pp.  465-471. 

Rissanen,  J.  (1986).  Stochastic  complexity  and  modeling.  Annals  of  Statistics  14(3),  pp. 
1080-1100. 

Rummelhart,  D.E.  and  McClelland,  J.L.  and  the  PDP  Research  Group  (1986).  Parallel 
Distributed  Processing:  Explorations  in  t^  Microstructure  of  Cognition,  vol.  1.  Cambridge: 
MIT  Press. 

Scalettar,  R.  and  Zee,  A.  (1988).  Emergence  of  grandmother  memory  in  feed  forward  networks: 
learning  with  noise  and  forgetfulness.  In  Connectionist  Models  and  Their  Implications: 
Readings  from  Cognitive  Science,  eds.  D.  Waltz  and  J.A.  Feldman,  pp.  309-332.  Norwood: 
Ablex. 

Schwarz,  G.  (1978).  Estimating  the  dimension  of  a  model.  Annals  of  Statistics  6(2),  pp. 
462-464. 

Seber,  G.A.F.  and  Wild,  C.J.  (1989).  Nonlinear  Regression.  New  York:  John  Wiley  and  Sons. 

Shibata,  R.  (1976).  Selection  of  the  order  of  an  autoregressive  model  by  Akaike’s  information 
criterion.  Biometrika  63,  pp.  1 17-126. 

Sietsma,  J.  and  Dow,  R.J.F.  (1988).  Neural  net  pruning  -  why  and  how.  In  IEEE  International 
Conference  on  Neural  Networks  (San  Diego,  1988),  vol.  I,  pp.  325-333.  New  York:  IEEE. 

Sirat,  J.-A.,  and  Nadal,  J.-P.  (1990).  Neural  trees:  a  new  tool  for  classification.  Preprint, 
Laboratoires  d’Electronique  Philips,  Limeil-Brevannes,  France. 

Smith,  A.F.M.  and  Spiegelhalter,  D.J.  (1980).  Bayes  factors  and  choice  criteria  for  linear 
models.  Journal  of  the  Royal  Statistical  Society  42(2),  pp.  213-220. 

Soulie,  F.F.,  Robert,  Y.  and  Tchuente,  M.,  eds.  (1987).  Automata  Networks  in  Computer 
Science:  Theory  and  Applications.  Princeton:  Princeton  University  Press. 

Stone.  M.  (1974).  Cross-validatory  choice  and  assessment  of  statistical  predictions.  Journal  of 
the  Royal  Statistical  Society  B  36,  pp.  111-147. 

Stone,  M.  (1977a).  Asymptotics  for  and  against  cross-validation.  Biometrika  64,  pp.  29-35. 

Stone,  M.  (1977b).  An  asymptotic  equivalence  of  choice  model  by  cross-validation  and 
Akaike’s  criterion.  Journal  of  the  Royal  Statistical  Society  B  39,  pp.  44-47. 


-24- 


Stone,  M.  (1978).  Comments  on  model  selection  criteria  of  Akaike  and  Schwarz.  Journal  of  the 
Royal  Statistical  Society  B  41,  pp.  276-278. 

Wahba,  G.  (1990).  Spline  Models  for  Observational  Data.  Philadelphia:  SIAM. 

Wei,  C.Z.  (1987).  Adaptive  prediction  by  least  squares  predictors  in  stochastic  regression 
models  with  applications  to  time  series.  Annals  of  Statistics  15(4),  pp.  1667-1682. 

White,  H.  (1984).  Asymptotic  Theory  for  Econometricians.  New  York:  Academic  Press. 

White,  H.  (1981).  Consequences  and  detection  of  misspecified  nonlinear  regression  models. 
Journal  of  the  American  Statistical  Association  76(374),  pp.  419-433. 

White,  H.  (1989).  Some  asymptotic  results  for  learning  in  single  hidden-layer  feedforward 

network  models.  Journal  of  the  American  Statistical  Association,  vol.  84,  no.  408,  pp. 
1003-1013. 


-25- 


UNCLASSIFIED 


SECURITY  CLASSIFICATION  OF  THIS  PAGE 


REPORT  DOCUMENTATION  PAGE 


la  REPORT  SECURITY  CLASSIFICATION 

Unclassified 

2a  SECURITY  CLASSIFICATION  AUTHORITY 

N/A 

2b  DECLASSIFICATION /DOWNGRADING  SCHEDULE  I 

N/A 

_ 

4  PERFORMING  ORGANIZATION  REPORT  NUMBER(S)  I 

Report  No.  91-  16 

6a  NAME  OF  PERFORMING  ORGANIZATION 

6b  OFFICE  SYMBOL 

(If  applicable) 

Naval  Health  Research  Center 

Code  22 

6c  ADDRESS  (Ofy,  State,  and  ZIP  Code) 

P.  0.  Box  85122 

San  Diego,  CA  92186-5122 

8a  NAME  OF  FUNDING  /  SPONSORING 

8b  OFFICE  SYMBOL 

ORGANIZATION  Naval  Medical 

(If  applicable) 

Research  &  Development  Command 

8c  ADDRESS  (Crty,  State,  and  ZIP  Code) 

NNMC 

Bethesda,  MD  20889-5044 

lb  RESTRICTIVE  MARKINGS 
N/A 


3  DISTRIBUTION /AVAILABILITY  OF  REPORT 

Approved  for  public  release;  distribution 
unlimited . 


5  MONITORING  ORGANIZATION  REPORT  NUMBER(S) 


7a  NAME  OF  MONITORING  ORGANIZATION 

Chief,  Bureau  of  Medicine  and  Surgery 


7b  ADDRESS  (Oty,  State,  and  ZIP  Code) 

Navy  Department 
Washington,  DC  20372-5120 


9  PROCUREMENT  INSTRUMENT  IDENTIFICATION  NUMBER 

American  Socieity  for  Engineering  Education 
(ASEE)  Navy  Summer  Faculty  Resarch  Program 


10  SOURCE  OF  FUNDING  NUMBERS 


PROGRAM 
ELEMENT  NO 


1 1  TITLE  (Include  Security  Classification) 

CRITERIA  FOR  CHOOSING  THE  BEST  NEURAL  NETWORK:  PART  I 


PROJECT 

TASK 

NO 

NO 

WORK  UNIT 
ACCESSION  NO 


12  PERSONAL  AUTHOR{S) 


Angus,  J.E. ,  Ph.D. 


13a  TYPE  OF  REPORT 

FINAL 


16  SUPPLEMENTARY  NOTATION 


13b  TIME  COVERED 
FROM  _  TO 


14  DATE^^F  REPORT^ I^Year,  Month,  Day;  jlS  PAGE  COUNT 


COSATI  COOES 


GROUP 


SUB-GROUP 


18  SUBJECT  TERMS  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 

Regression,  Classification,  Overfitting,  Underfitting, 
Principal  Components 


19  ABSTRACT  {Continue  on  reverse  if  necessary  and  identify  by  block  number) 

An  investigation  into  the  problem  of  determining  a  parsimonious  neural  network  for  use 
in  prediction/generalization  based  on  a  given  fixed  learning  sample  was  undertaken.  Both  the 
classification  and  nonlinear  regression  contexts  were  addres^.  An  exposition  and  survey  of  the 
problem  and  past  research  on  model  selection  techniques  in  other  statistical  settings  was 
compiled,  and  algorithms  for  selecting  the  number  of  hidden  layer  nodes  in  a  three  layer, 
.  feedforward  neur^  network  were  develo^ped.  The  selection  criteria  developed  attempt  to  "grow" 
the  networks  beginning  with  a  small  initial  number  of  hidden  layer  nodes  (as  opposed  to  pruning 
a  relatively  large  network).  For  the  nonlinear  regression  problem,  the  method  is  based  on 
cross-validation  estimates  of  the  prediction  mean  squared  error  for  the  candidate  networks.  For 
the  classification  problem,  the  method  is  based  on  a  cost  complexity  measure  of  the  candidate 
networks  based  on  resubstitution  estimates  of  the  probability  of  misclassincation  and  a  penalty 
function  of  the  number  of  bidden  layer  nodes.  Also  considered  was  the  use  of  principal 


20  DISTRIBUTION/ AVAILABILITY  OF  ABSTRACT 

□  UNCLASSIFIED/UNLIMITEO  (Zl  SAME  AS  RPT  □  OTIC  USERS 


22a  NAME  OF  RESPONSIBLE  INDIVIDUAL 

William  Pueh 


DDFORM  1473,  84  MAR  83  APR  edition  may  be  used  until  exhausted 

All  other  editions  are  obsolete 


21  ABSTRACT  SECURITY  CLASSIFICATION 
□  DTic  USERS  Unclassified 


22b  TELEPHONE  (/nc/ude  Area  Code)  22c  OFFICE  SYMBOL 

619-553-8403  Code  22 


SECURITY  CLASSIFICATION  OF  THIS  PAGE 

#U.S.  Govtrnwwnt  IMB-407’047 


